Now Reading
40 Interview Questions On Statistics For Data Scientists

40 Interview Questions On Statistics For Data Scientists

Rohit Garg

We frequently come out with resources for aspirants and job seekers in data science to help them make a career in this vibrant field. Cracking interviews especially where understating of statistics is needed can be tricky. Here are 40 most commonly asked interview questions for data scientists, broken into basic and advanced.

Here are some other interview questions resources for data scientists.

10 Most Common SQL Questions & Answers You Must Know For Your Next Interview



10 Frequently Asked Interview Questions For Machine Learning In 2019

5 Mathematical Concepts Every Data Scientist Should Master Before An Interview


W3Schools

10 Important Pandas Interview Questions Every Beginner Must Know

11 Most Commonly Asked NLP Interview Questions For Beginners

12 Most Popular Python Interview Questions You Must Prepare For

10 Most Frequently Asked Questions In Data Science Interview

Top Interview Questions For A Data Engineer Job Profile


Part 1 – Basic Statistics and Distributions

20 Question

  1. What is the difference between data analysis and machine learning?

Data analysis requires strong knowledge of coding and basic knowledge of statistics

Machine learning, on the other hand, requires basic knowledge of coding and strong knowledge of statistics and business.


2. What is big data?

Big data has 3 major components – volume (size of data), velocity (inflow of data) and variety (types of data)

Big data causes “overloads”


3. What are the four main things we should know before studying data analysis?

Descriptive statistics

Inferential statistics

Distributions (normal distribution / sampling distribution)

Hypothesis testing


4. What is the difference between inferential statistics and descriptive statistics?

Descriptive statistics – provides exact and accurate information.

Inferential statistics – provides information of a sample and we need to inferential statistics to reach to a conclusion about the population.


5. What is the difference between population and sample in inferential statistics?

From the population we take a sample. We cannot work on the population either due to computational costs or due to availability of all data points for the population.  

From the sample we calculate the statistics

From the sample statistics we conclude about the population


6. What are descriptive statistics?

Descriptive statistic is used to describe the data (data properties)

5-number summary is the most commonly used descriptive statistics


7. Most common characteristics used in descriptive statistics?

  • Center – middle of the data. Mean / Median / Mode are the most commonly used as measures.
    • Mean – average of all the numbers
    • Median – the number in the middle
    • Mode – the number that occurs the most. The disadvantage of using Mode is that there may be more than one mode.
  • Spread – How the data is dispersed. Range / IQR / Standard Deviation / Variance are the most commonly used as measures.
    • Range = Max – Min
    • Inter Quartile Range (IQR) = Q3 – Q1
    • Standard Deviation (σ) = √(∑(x-µ)2 / n)
    • Variance = σ2
  • Shape – the shape of the data can be symmetric or skewed
    • Symmetric – the part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median
    • Left skewed – the left tail is longer than the right side
    • Right skewed – the right tail is longer than the left side 
  • Outlier – An outlier is an abnormal value
    • Keep the outlier based on judgement
    • Remove the outlier based on judgement

8. What is quantitative data and qualitative data?

Quantitative data is also known as numeric data

Qualitative data is also known as categorical data


9. How to calculate range and interquartile range?

IQR = Q3 – Q1

Where, Q3 is the third quartile (75 percentile) 

Where, Q1 is the first quartile (25 percentile)


10. Why we need 5-number summary?

Low extreme (minimum)

Lower quartile (Q1)

Median

Upper quartile (Q3)

Upper extreme (maximum)


11. What is the benefit of using box plot?

Shows the 5-number summary pictorially

Can be used to compare group of histograms


12. What is the meaning of standard deviation?

It represents how far are the data points from the mean

(σ) = √(∑(x-µ)2 / n)

Variance is the square of standard deviation


13. What is left skewed distribution and right skewed distribution?

  • Left skewed
    • The left tail is longer than the right side
    • Mean < median < mode
  • Right skewed
    • The right tail is longer than the right side
    • Mode < median < mean

14. What does symmetric distribution mean?

The part of the distribution that is on the left side of the median is same as the part of the distribution that is on the right side of the median

Few examples are – uniform distribution, binomial distribution, normal distribution


15. What is the relationship between mean and median in normal distribution?

In the normal distribution mean is equal to median


16. What does it mean by bell curve distribution and Gaussian distribution?

Normal distribution is called bell curve distribution / Gaussian distribution

It is called bell curve because it has the shape of a bell

It is called Gaussian distribution as it is named after Carl Gauss


17. How to convert normal distribution to standard normal distribution?

Standardized normal distribution has mean = 0 and standard deviation = 1

To convert normal distribution to standard normal distribution we can use the formula

X (standardized) = (x-µ) / σ


18. What is an outlier?

An outlier is an abnormal value (It is at an abnormal distance from rest of the data points). 


19. Mention one method to find outliers?

Shows the 5-number summary can be used to identify the outlier

Widely used – Any data point that lies outside the 1.5 * IQR

Lower bound = Q1 – (1.5 * IQR)

Upper bound = Q3 + (1.5 * IQR)


20. What can I do with outlier?

  • Remove outlier
    • When we know the data-point is wrong (negative age of a person)
    • When we have lots of data
    • We should provide two analyses. One with outliers and another without outliers.
  • Keep outlier
    • When there are lot of outliers (skewed data)
    • When results are critical
    • When outliers have meaning (fraud data)

Part 2 – Advance Statistics and Hypothesis Testing

20 Question

21. What is the difference between population parameters and sample statistics?

  • Population parameters are:
    • Mean = µ
    • Standard deviation = σ
  • Sample statistics are:
    • Mean = x (bar)
    • Standard deviation = s

22. Why we need sample statistics?

Population parameters are usually unknown hence we need sample statistics.


23. How to find the mean length of all fishes in the sea?

Define the confidence level (most common is 95%)

Take a sample of fishes from the sea (to get better results the number of fishes > 30)

Calculate the mean length and standard deviation of the lengths

Calculate t-statistics

Get the confidence interval in which the mean length of all the fishes should be.


24. What are the effects of the width of confidence interval?

  • Confidence interval is used for decision making
  • As the confidence level increases the width of the confidence interval also increases
  • As the width of the confidence interval increases, we tend to get useless information also.
    • Useless information – wide CI
    • High risk – narrow CI

25. Mention the relationship between standard error and margin of error?

As the standard error increases the margin of error also increases


26. Mention the relationship between confidence interval and margin of error?

See Also

As the confidence level increases the margin of error also increases


27. What is the proportion of confidence interval that will not contain the population parameter?

Alpha is the portion of confidence interval that will not contain the population parameter

α = 1 – CL


28. What is the difference between 95% confidence level and 99% confidence level?

The confidence interval increases as me move from 95% confidence level to 99% confidence level


29. What do you mean by degree of freedom?

DF is defined as the number of options we have 

DF is used with t-distribution and not with Z-distribution

For a series, DF = n-1 (where n is the number of observations in the series)


30. What do you think if DF is more than 30?

As DF increases the t-distribution reaches closer to the normal distribution

At low DF, we have fat tails

If DF > 30, then t-distribution is as good as normal distribution


31. When to use t distribution and when to use z distribution?

  • The following conditions must be satisfied to use Z-distribution
    • Do we know the population standard deviation?
    • Is the sample size > 30?
    • CI = x (bar) – Z*σ/√n to x (bar) + Z*σ/√n
  • Else we should use t-distribution
    • CI = x (bar) – t*s/√n to x (bar) + t*s/√n

32. What is H0 and H1? What is H0 and H1 for two-tail test?

  • H0 is known as null hypothesis. It is the normal case / default case.
    • For one tail test x <= µ
    • For two-tail test x = µ
  • H1 is known as alternate hypothesis. It is the other case.
    • For one tail test x > µ
    • For two-tail test x <> µ

33. What is p-value in hypothesis testing?

  • If the p-value is more than then critical value, then we fail to reject the H0
    • If p-value = 0.015 (critical value = 0.05) – strong evidence
    • If p-value = 0.055 (critical value = 0.05) – weak evidence
  • If the p-value is less than the critical value, then we reject the H0
    • If p-value = 0.055 (critical value = 0.05) – weak evidence
    • If p-value = 0.005 (critical value = 0.05) – strong evidence

34. How to calculate p-value using manual method?

Find H0 and H1

Find n, x(bar) and s

Find DF for t-distribution

Find the type of distribution – t or z distribution

Find t or z value (using the look-up table)

Compute the p-value to critical value


35. How to calculate p-value using EXCEL?

Go to Data tab

Click on Data Analysis

Select Descriptive Statistics

Choose the column

Select summary statistics and confidence level (0.95)


36. What do we mean by – making decision based on comparing p-value with significance level?

If the p-value is more than then critical value, then we fail to reject the H0

If the p-value is less than the critical value, then we reject the H0


37. What is the difference between one tail and two tail hypothesis testing?

  • 2-tail test: Critical region is on both sides of the distribution
    • H0: x = µ
    • H1: x <> µ
  • 1-tail test: Critical region is on one side of the distribution
    • H1: x <= µ
    • H1: x > µ

38. What do you think of the tail (one tail or two tail) if H0 is equal to one value only?

It is a two-tail test


39. What is the critical value in one tail or two-tail test?

Critical value in 1-tail = alpha

Critical value in 2-tail = alpha / 2


40. Why is the t-value same for 90% two tail and 95% one tail test?

P-value of 1-tail = P-value of 2-tail / 2

It is because in two tail there are 2 critical regions

Provide your comments below

0 comments

Money-Transfer_Photo-by-Alistair-MacRobert-on-Unsplash

Download our Mobile App


Money transfer companies make a considerable amount of their revenue from the fees they levy on each transaction, and therefore factors like transfer rate, fees associated, and transfer speed plays a vital role for a company to thrive in this competitive world of money transfer. With banks offering attractive prices, money transfer companies started to struggle to stay on top of their game. In a bid to stay relevant, one such veteran forex company switched to artificial intelligence-based solutions for critical data in order to tackle the changing forex landscape. 



The Challenge

With the rise in advanced technologies in other parts of our lives, tech-savvy customers expect BFSI companies to deliver seamless financial experience. And that’s why BFSI companies are equipping themselves with tools that are required to stay ahead of competitors. 

A Colorado-based cross border, cross-currency, money movement payment company (wishes to remain anonymous) that allows transactions, online as well as offline, have been struggling a long time in streamlining their business process in order to deliver enhanced customer service.


W3Schools

Considering the massive scale of operation, with five lakh agent locations in over 200 countries and territories, and a plethora of online payment options, the customer had to deal with numerous challenges, such as volatile market rates, competitors trying to undercut them, and a changing marketplace. An incumbent in the money transfer space, the company had completed more than 800 million transactions in 2018 and has moved over $300 billion in principal amount. 

To tackle these problems efficiently, the customer required a seamless framework that can provide detailed information on the competitors’ positioning and their pricing strategies, along with their transfer fees, transfer speed, and the exchange rate of the competitors per transaction. 

The Solution

The answer to the company’s problem was data. And, therefore, the customer demanded an AI-based framework that could obtain the high quality, hard to get, competitor intelligence, which Bridged agreed on to provide. 

Bridged is a company that provides hard to obtain human-powered data for artificial intelligence (AI) models at a scale that can create a competitive advantage and grow revenue for its customers. The company used a combination of artificial intelligence technologies and a 13,000 strong, highly skilled workforce to develop unique and vast data at a scale, significantly improving the quality of data models. Whether it be training data for machine learning models or competitive intelligence, they drew a “bridge” between the problem and solution for the customer.

The scalable solution included retail intelligence, content generation, content categorisation, and video and image tagging for companies across the world. The massive workforce captured over a million data points from top-tier competitors daily, which they converted into statistical information and leveraged it to provide insights into new market opportunities, competitor behaviour, and price optimisation, amongst many others.

Bridged has serviced numerous AI/machine learning companies with their data requirements and therefore, has become the perfect choice for the forex exchange company to deal with their challenges. With the help of Bridged’s expertise, the customer was able to receive agent store location data of its competitors for 50+ countries; and pricing data such as transfer speed, transfer rate, and the exchange rate of 13 competitors spanning 1300+ countries and 650+ currency pairs in the online and offline space. 

With the combination of technology and crowdsourcing Bridged delivered artificial intelligence as a service, providing scalable data solution for the customer. Along with a scaled process to capture real-time data on multiple competitors, Bridged also designed a fully managed process to cover multiple sources, and utilise multiple APIs, to ensure the data shared was accurate and complete. Additionally, the customer used the data provided, giving it the ability to move to a dynamic pricing structure and thus winning business.

Benefits

Being a brand that services artificial intelligence and machine learning companies with their data requirements, Bridged understood the company’s requirement for high-quality data. And therefore, it delivered 150k+ data rows per day, thus giving the customer precise insights for the company to adjust its offerings and prices. With the help of the structured data, it adjusted its offering and pricing parameters, which, in turn, increased their earnings from each transaction. The customer also enjoyed the luxury of obtaining data that they’ve struggled to find and utilise in the past.

By collaborating with Bridged, the customer was able to gain a strong data-driven understanding of the competitive landscape across geographies. Alongside, they had the privilege of receiving critical data on store locations, for its offline market, that wasn’t accessible before. This information also helped the customer to analyse into which geographical locations they can maximise their bottom-line.

In terms of return of investment, working with Bridged, helped the customer to reduce its expenses by 70% and time to access this data by 90%.

Future Prospects

The customer aims to expand its collaboration with Bridged and the scope of its requirements with a focus on capturing the location data for more countries, and the pricing data across multiple channels, such as mobile apps, etc. Besides, Bridged is looking to use its crowd for auditing the retail locations to ensure better customer service. 

Presently, they’re working on an FX product — Smart Pricing System, that has already captured 25M+ data rows from leading money transfer companies. It captures data points such as transfer fees, transfer speed, and exchange rate for various send amounts and currency corridors. The product has been designed to help money transfer companies to locate new market opportunities, track their competition, increase their revenue, price their transactions strategically, and analyse competitors’ reactions to a given change in price.

Medicine By AI

Download our Mobile App


Artificial intelligence has been assisting humans in finding patterns in biological data in order to predict potential diseases. In a few cases, it is even outperforming prominent doctors in determining the ailments. However, with the latest advancements, pharma companies have turned towards AI to expedite the drug discovery. Sooner rather than later, we will be witnessing medicines developed by AI that are used on humans to cure various disorders. Although it can be compelling even to imagine the integration of AI in drug discovery, a few experts are critical of such methodologies.



AI Drug & Its Use On Human

Exscientia, a British-based startup, along with Japanese pharmaceutical firm Sumitomo Dainippon Pharma has invented a drug-using AI methodologies that will be used on patients who have obsessive-compulsive disorder (OCD). Drug development and clinical tests usually take more than four and a half years before it can be supplied in the market. However, the firm has invented this new drug in twelve months. “Direct use of AI in the creating of a new medicine is a key milestone in drug discovery,” said Prof Andrew Hopkins, CEO of Exscienta. “We have seen the deployment of AI in clinics for scanning the reports of patients but not in drug development.”

The molecule DSP-1181 was created by utilising algorithms that sifted through potential compounds, checking them against a vast database of parameters. This collaborative effort from both companies played an essential role in making this breakthrough as Sumitomo Dainippon Pharma provided its experience in monoamine GPCR drug, and Exscientia contributed in applying its Centaur Chemist (TM) artificial intelligence (AI) platform for drug discovery. “This year was the first to have an AI-designed drug, but by the end of the decade all new drugs could potentially be created by AI,” said Prof Hopkins.


W3Schools

Phase I of the clinical study of DSP-1181 has been initiated in Japan. It is a critical step the AI drug as the results will be crucial for comparison with traditional medicines that often fail in the first phase; only 15% of drugs advances from stage 2 for approval. 

Can We Leave Our Health On The Hands Of AI?

“Silicon Valley mindset can be dangerous for clinicians. Today, we live by the attitude that when lives are at stake, we embrace promising new ideas as quickly as possible. However, this has got us into a cancer-screening mess,” said Vinay Prasad, author of Ending Medical Reversal.

AI, even in a clinical trial, have failed as is not as useful as companies portray it to be. Technology firms mostly rely on a closed environment test and fail to deliver in real-world. For one, Google’s study of identifying breast cancer has been better than doctors and was believed as a breakthrough but later was considered ineffective as it failed to answer various questions. Today, firms have adopted the “move fast and break things,” a motto of Facebook in its early years, which has led companies to ignore many essential questions while developing products related to health.

Shortage of information and diversity of data is one of the biggest challenges that health tech firms have failed to mitigate. Such problems had lead IBM in 2019 to pull back on AI drug discovery. “There’s a lot of hype being talked in the business literature about AI-based health tools, but are ineffectual,” said Elsevier’s. Nevertheless, AI has the potential to revolutionise the way we diagnose ailments, but it is still in a nascent stage.

Outlook

AI for research in healthcare is a wise step, but before mitigating the biasedness and before accomplishing the explainability in ML models, it might just be too early to use medicines made from AI in the real world. AI can democratise healthcare in ways we can only dream of by allowing equal care for all. However, it still needs to mature,” said Jose Morey, MD, a physician and an AI expert. Undoubtedly, AI in coming years will be widely accepted for our good, which will not only help in operational saving for drug companies but also accelerate the drug discovery.

Embrace Facial Recognition

Download our Mobile App


Facial recognition technology has become the epicentre for outrage among people due to privacy concerns. While some governments consider it as a threat to civilian rights, others justify the deployment of facial recognition technologies to tackle crimes. Deemed as a surveillance tool, the European Union is considering a five-year ban on the implementation of facial recognition to avoid use in public areas like streets, railway stations, and more. 



However, countries such as India and China are actively embracing facial recognition technology. While India’s home ministry announced its intention to install the world’s largest automated facial recognition software (AFRS), China is using it to monitor activities in public areas. As per research, facial recognition technology is expected to grow and reach $9.6 billion by 2020.

Why Facial Recognition Is On The Spotlight Again

Sundar Pichai, CEO of Alphabet, in at least more than one conference supported the potential temporary ban on facial recognition by the EU. “I think it is important that governments and regulations tackle it sooner rather than later and give a framework for it,” said Pichai. However, Brad Smith, chief legal officer of Microsoft, had a different opinion when asked about the EU’s move. He said its a young technology and will only get better. But it will only improve if we have more people using it.


W3Schools

Amidst several concerns of facial recognition technology, the adoption has only grown across the world. More recently, London’s police on Friday announced that it would be deploying facial recognition to pinpoint crimes. The Metropolitan Police said that technology is crucial for identifying and acting on crimes and violence. Earlier the London police used to match the images with databases to determine suspects; however, with the new announcement, they will now automate the process with real-time detection.

Rise In The Adoption Of Facial Recognition Technology

Similar to the London police, the New York Police Department uses the image matching technique. According to a report, more than 600 law enforcement agencies have been leveraging facial recognition. In 2019, Kenya also announced the need for surveillance to enhance security, which led them to use facial recognition tech from Chinese firms. 

Although China is considered a significant driver of surveillance, other countries such as Japan, France, and the US, are a substantial contributor to the supply of surveillance technology. According to a report, US tech firms supply surveillance technology to thirty-two countries.

Contrary to the narrative of people, facial recognition technology has helped countries to maintain law and order. For one, Wales’ police got hold of 58 wanted people using facial recognition technology. Besides, in 2019 it was used in public places like Airports in India to expedite the security check and eliminate the need for paper-based boarding passes. Further, the use of facial authentication for new mobile users will restrict the misuse of services.

Outlook

Facial recognition technology has various shortcomings which invoke concerns about its adoption. Some of the most notable ones are of being biased, a threat to privacy, and miss identifying people, to name a few. On numerous occasions, the technology has failed to work, thereby drawing the attention of people about its effectiveness. 

However, like any other, it is prone to various flaws, but this doesn’t mean we should ban it. Today, we embrace machine learning models in several use cases even though they are not cent per cent accurate. Similarly, a shortcoming shouldn’t be the reason for squashing the advantages that we can harness from the technology. 

Undoubtedly, there is a need for regulation, but temporarily banning it for analysing its potential misuse can slacken the development of the technology. Consequently, we should embrace facial recognition and continue to improve it as we move forward.

Copyright Analytics India Magazine Pvt Ltd

Scroll To Top
Sumo